Multi-armed Bandit Problems with History

نویسندگان

  • Pannagadatta K. Shivaswamy
  • Thorsten Joachims
چکیده

In a multi-armed bandit problem, at each time step, an algorithm chooses one of the possible arms and observes its rewards. The goal is to maximize the sum of rewards over all time steps (or to minimize the regret). In the conventional formulation of the problem, the algorithm has no prior knowledge about the arms. Many applications, however, provide some data about the arms even before the algorithm starts. For example: a search engine company obtained data from a small sample of paid users on its newly developed retrieval functions. The availability of such historic data leads to the question of how online learning algorithms can best use it to reduce regret. This problem is meaningful only for the case of stochastic arms [1]. We propose algorithms which show that a logarithmic amount of historic data allows them to achieve constant regret. The work by [2] assumes that historic data collected via some policy is available to evaluate a mapping from side information to arms. In the absence of side-information, their policy evaluation strategy reduces to choosing the arm with the highest mean reward on the historic data. In a K armed stochastic bandit problem, random variable Xj,t ∈ [0, 1] (1 ≤ j ≤ K, t ≥ 1) denotes the reward incurred when the j arm is pulled the t time. For arm j, the rewards Xj,t are iid with mean μj and variance σ 2 j . The best arm is denoted by j i.e., μj∗ := max1≤i≤K μi. Historic data is denoted by X h j,t ∈ [0, 1] for 1 ≤ j ≤ K and 1 ≤ t ≤ Hj . The historic rewards for each arm are assumed to be drawn iid as well. Tj(n) denotes the number of times the arm j is pulled between times 1 and n (this excludes the pulls of the arm in the historic data). The regret at time n is defined as REG(n) := μj∗n − μj ∑K j=1 E[Tj(n)]. The per-round regret at time n is defined as REG(n)/n. Mean reward of arm j during the execution of the algorithms until its n pull is defined as X̄j,n. Analogously, the joint mean reward of arm j incorporating both the historic and the online data is X̄ j,n.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cognitive Capacity and Choice under Uncertainty: Human Experiments of Two-armed Bandit Problems

The two-armed bandit problem, or more generally, the multi-armed bandit problem, has been identified as the underlying problem of many practical circumstances which involves making a series of choices among uncertain alternatives. Problems like job searching, customer switching, and even the adoption of fundamental or technical trading strategies of traders in financial markets can be formulate...

متن کامل

The Irrevocable Multiarmed Bandit Problem

This paper considers the multi-armed bandit problem with multiple simultaneous arm pulls and the additional restriction that we do not allow recourse to arms that were pulled at some point in the past but then discarded. This additional restriction is highly desirable from an operational perspective and we refer to this problem as the ‘Irrevocable Multi-Armed Bandit’ problem. We observe that na...

متن کامل

Enhancing Evolutionary Optimization in Uncertain Environments by Allocating Evaluations via Multi-armed Bandit Algorithms

Optimization problems with uncertain fitness functions are common in the real world, and present unique challenges for evolutionary optimization approaches. Existing issues include excessively expensive evaluation, lack of solution reliability, and incapability in maintaining high overall fitness during optimization. Using conversion rate optimization as an example, this paper proposes a series...

متن کامل

Multi-Armed Bandit Policies for Reputation Systems

The robustness of reputation systems against manipulations have been widely studied. However, the study of how to use the reputation values computed by those systems are rare. In this paper, we draw the analogy between reputation systems and multi-armed bandit problems. We investigate how to use the multi-armed bandit selection policies in order to increase the robustness of reputation systems ...

متن کامل

Learning to Play K-armed Bandit Problems

We propose a learning approach to pre-compute K-armed bandit playing policies by exploiting prior information describing the class of problems targeted by the player. Our algorithm first samples a set of K-armed bandit problems from the given prior, and then chooses in a space of candidate policies one that gives the best average performances over these problems. The candidate policies use an i...

متن کامل

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Multi-armed bandit problems are the most basic examples of sequential decision problems with an exploration–exploitation trade-off. This is the balance between staying with the option that gave highest payoffs in the past and exploring new options that might give higher payoffs in the future. Although the study of bandit problems dates back to the 1930s, exploration–exploitation trade-offs aris...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012